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SYSTEM AND METHOD FOR ANNOTATING MULTI-MODAL 
CHARACTERISTICS IN MULTIMEDIA DOCUMENTS 

Field of the Invention 

The present invention relates to the computer processing of multimedia files. 
5 More specifically, the present invention relates to the manual annotation of multi- 
modal events, objects, scenes, and audio occurring in multimedia files. 

Background of the Invention 

Multimedia content is becoming more common both on the World Wide Web 
and local computers. As the corpus of multimedia content increases, the indexing of 
10 features within the content becomes more and more important. Observing both audio 
and video simultaneously and annotating that observation results in a higher 
confidence level. 

Existing multimedia tools provide capabilities to annotate either audio or 
video separately, but not as a whole. (An example of a video-only annotation tool is 
15 the IBM MPEG7 Annotation Tool, inventors J. Smith et al., available through 
[http://Iwww.alphaworks.ibm.coin/tech/videoannex. Other conventional 
arrangements are described in: Park et al, "iMEDIA-CAT: Intelligent Media Content 
Annotation Tool' 1 , Proc. International Conference on Inductive Modeling (ICIM) 
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2001, South Korea, November, 2001; and Minka et al., "Interactive Learning using a 
Society of Models/ 1 Pattern Recognition, Vol. 30, pp. 565, 1997, TR #349. 

It has long been recognized that annotating video or audio features in isolation 
results in a less confidence of the identification of the features. 

5 In view of the foregoing, a need has been recognized in connection with 

providing improved systems and methods for observing and annotating multi-modal 
events, objects, scenes, and audio occurring in multimedia files. 

Summary of the Invention 

In accordance with at least one presently preferred embodiment of the present 
10 invention, there are broadly contemplated multimedia annotation systems and 

methods that permit users to observe solely video, video with audio, solely audio, or 
audio with video and to annotate what has been observed. 

In one embodiment, there is provided a computer system which has one or 
more multimedia files that are stored in a working memory. The multi-modal 
15 annotation process displays a user selected multimedia file, permits the selection of a 
mode or modes to observe the file content, annotates the observations; and saves the 
annotations in a working memory (such as a MPEG-7 XML file). 

In summary, one aspect of the invention provides an apparatus for managing 
multimedia content, the apparatus comprising: an arrangement for supplying 
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multimedia content; an input interface for permitting the selection, for observation, of 
at least one of the following modes associated with the multimedia content: an audio 
portion that includes video; and a video portion that includes audio; and an 
arrangement for annotating observations of a selected mode. 

5 A further aspect of the invention provides a method of managing multimedia 

content, the method comprising the steps of: supplying multimedia content; 
permitting the selection, for observation, of at least one of the following modes 
associated with the multimedia content: an audio portion that includes video; and a 
video portion that includes audio; and annotating observations of a selected mode. 

10 Furthermore, an additional aspect of the invention provides a program storage 

device readable by machine, tangibly embodying a program of instructions executable 
by the machine to perform method steps for managing multimedia content, the method 
comprising the steps of: supplying multimedia content; permitting the selection, for 
observation, of at least one of the following modes associated with the multimedia 

15 content: an audio portion that includes video; and a video portion that includes audio; 
and annotating observations of a selected mode. 

For a better understanding of the present invention, together with other and 
further features and advantages thereof, reference is made to the following 
description, taken in conjunction with the accompanying drawings, and the scope of 
20 the invention will be pointed out in the appended claims. 
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Brief Description of the Drawings 

Figure 1 is a block diagram depicting a multi-modal annotation system. 
Figure 2 is an illustration of a system annotating video scenes, objects, and 

events. 

5 Figure 3 is an illustration of a system annotating audio with video. 

Figure 4 is an illustration of a system annotating audio without video. 

Description of the Preferred Embodiments 

Figure 1 is a block diagram of one preferred embodiment of a multi-modal 
annotation system in accordance with the present invention. The multimedia content 

10 and previous annotations are stored on the storage medium 100. When a user 130 
selects a multimedia file via the annotation tool from the storage medium 100, it is 
loaded into working memory 1 10 and portions of it displayed in the annotation tool 
120. At any time, the user 130 may also request that previously saved annotations 
associated with the current multi-modal file be loaded from the storage medium 100 

15 into working memory 1 10. The user 100 views the multimedia data by making 
requests through the annotation tool 120. The user 130 then annotates his 
observations and the annotation tool 120 saves these annotations in working memory 
1 10. The user can at anytime request the annotation tool 120 to save the annotation on 
the storage medium 100. 
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Figure 2 is an illustration of a system annotating video scenes, objects, and 
events. (Simultaneous reference should also be made to Fig. 1.) The multimedia data 
has been loaded from the storage medium 100 into working memory 1 10. A video tab 
290 has been selected. The multimedia video has been segmented using scene 

5 changed detection into shots. A shot list window 200 displays a portion of the shots 
in the multimedia. Here, the user 130 has selected a shot 210 which is highlighted in 
the shot list window 200. A key frame 220, which is a representative shot in the 
frames of a shot, is preferably displayed. In addition, the frames of the shot maybe 
viewed in the video window 230 using play controls 240. The video can be viewed 

10 with or without audio depending upon the selection of a mute button 250. The user 
130 may select annotations for this shot by clicking the boxes in events 260, static 
scenes 270, or key objects 280 lists of boxes. Any significant observations which are 
not contained in the check boxes can be noted in a keywords text box 300. 

Figure 3 is an illustration of the system annotating audio with video. 

15 (Simultaneous reference should also be made to Fig. 1.) The multimedia data has 

been loaded from the storage medium 100 into working memory 1 10. The audio with 
video tab 370 has been selected. The multimedia video has been segmented using 
scene change detection into shots. The shot list window 200 displays a portion of the 
shots in the multimedia. The shot 210 associated with the current audio position is 

20 highlighted in the shot list window 200. The audio data is displayed in the window 
390. A segment of audio 340 has been delimited for annotation; that is, the limits or 
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bounds of the audio has been fixed for subsequent annotation. The video associated 
with the audio is shown in 230. As the user 130 uses the play controls 360, the audio 
data display 390 is updated to display the current audio data and the video window 
230 changes to reflect the current video frame. Thus, the user 130 may observe the 
5 video and simultaneously hear the audio while making audio annotations. The user 
130 preferably uses the buttons 350 to delimit audio segments. Check boxes 
corresponding to the foreground sounds (320) (the most prominent sounds in the 
segment) and background sounds (330) (sounds which are present but are secondary 
to other sounds) may be checked to indicated sounds heard within the audio segment 
10 340. Any significant observations which are not contained in the check boxes can be 
noted in keywords text box 300. 

Figure 4 is an illustration of the system annotating audio without video. 
(Simultaneous reference should be made to Fig. 1.) The multimedia data has been 
loaded from the storage medium 100 into working memory 1 10. Audio-without-video 

15 tab 400 has been selected. The audio data is displayed in the window 390. A segment 
of audio 340 has been delimited for annotation. As the user 130 uses the play controls 
360, the audio data display 390 is updated to display the current audio data. Thus, the 
user 130 may only hear the audio while making audio annotations. The user 130 uses 
the buttons 350 to delimit audio segments. The check boxes for foreground sounds 

20 320 and background sounds 330 may be checked to indicate sounds heard within the 
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10 



15 



audio segment 340. Any significant observations which are not contained in the check 
boxes can be noted in the keywords text box 300. 

It is to be understood that the present invention, in accordance with at least one 
presently preferred embodiment, includes an arrangement for supplying multimedia 
content, an input interface for permitting the selection, for observation, of a mode 
associated with the multimedia content, and an arrangement for annotating 
observations of a selected mode. Together, these elements may be implemented on at 
least one general-purpose computer running suitable software programs. These may 
also be implemented on at least one Integrated Circuit or part of at least one Integrated 
Circuit. Thus, it is to be understood that the invention may be implemented in 
hardware, software, or a combination of both. 

If not otherwise stated herein, it is to be assumed that all patents, patent 
applications, patent publications and other publications (including web-based 
publications) mentioned and cited herein are hereby fully incorporated by reference 
herein as if set forth in their entirety herein. 

Although illustrative embodiments of the present invention have been 
described herein with reference to the accompanying drawings, it is to be understood 
that the invention is not limited to those precise embodiments, and that various other 
changes and modifications may be affected therein by one skilled in the art without 
departing from the scope or spirit of the invention. 
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Claims 

What is claimed is: 

L An apparatus for managing multimedia content, said apparatus comprising: 

an arrangement for supplying multimedia content; 

5 an input interface for permitting the selection, for observation, of at least one 

of the following modes associated with the multimedia content: an audio portion that 
includes video; and a video portion that includes audio; and 

an arrangement for annotating observations of a selected mode. 

2. The apparatus according to Claim 1, wherein said input interface permits 
10 the selection, for observation, of both of the following associated with the multimedia 

content: an audio portion that includes video; and a video portion that includes audio. 

3. The apparatus according to Claim 1, wherein said input interface 
additionally permits the selection, for observation, of solely a video portion of 
multimedia content. 

15 4. The apparatus according to Claim 1, wherein said input interface 

additionally permits the selection, for observation, of solely an audio portion of 
multimedia content. 
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5. The apparatus according to Claim 1, wherein said arrangement for 
supplying multimedia content comprises a working memory which stores multimedia 
files. 

6. The apparatus according to Claim 1, wherein said input interface is adapted 
5 to: first permit the selection of a multimedia file and then permit the selection of said 

at least one of: an audio portion simultaneously with video; and a video portion 
simultaneously with audio. 

7. The apparatus according to Claim 1, further comprising a working memory 
for saving the annotated observations of a selected mode. 

10 8. The apparatus according to Claim 1, wherein said input interface is adapted 

to permit the selection, for observation, at least the following mode associated with 
the multimedia content: a video portion that includes audio. 

9. The apparatus according to Claim 8, wherein said input interface 
comprises: 

15 an arrangement for permitting the selection, for observation, of a video mode 

of multimedia content; and 

an arrangement for selectably adding audio to the video mode for observation. 

10. A method of managing multimedia content, said method comprising the 
steps of: 
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supplying multimedia content; 

permitting the selection, for observation, of at least one of the following modes 
associated with the multimedia content: an audio portion that includes video; and a 
video portion that includes audio; and 

5 annotating observations of a selected mode. 

11. The method according to Claim 10, wherein said step of permitting 
selection comprises permitting the selection, for observation, of both of the following 
associated with the multimedia content: an audio portion that includes video; and a 
video portion that includes audio. 

10 12. The method according to Claim 10, wherein said step of permitting 

selection additionally comprises permitting the selection the selection, for observation, 
of solely a video portion of multimedia content. 

13. The method according to Claim 10, wherein step of permitting selection 
comprises permitting the selection, for observation, of solely an audio portion of 

15 multimedia content. 

14. The method according to Claim 10, wherein said step of supplying 
multimedia content comprises providing a working memory which stores multimedia 
files. 
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15. The method according to Claim 10, wherein said step of permitting 
selection comprises: first permitting the selection of a multimedia file and then 
permitting the selection of said at least one of: an audio portion simultaneously with 
video; and a video portion simultaneously with audio. 

5 16. The method according to Claim 10, further comprising the step of 

providing a working memory for saving the annotated observations of a selected 
mode. 

17. The method according to Claim 10, wherein said step of permitting 
selection comprises permitting the selection, for observation, at least the following 

10 mode associated with the multimedia content: a video portion that includes audio. 

18. The method according to Claim 17, wherein said step of permitting 
selection comprises: 

permitting the selection, for observation, of a video mode of multimedia 
content; and 

15 thereafter enabling the addition of audio to the video mode for observation. 

19. A program storage device readable by machine, tangibly embodying a 
program of instructions executable by the machine to perform method steps for 
managing multimedia content, said method comprising the steps of: 
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supplying multimedia content; 

permitting the selection, for observation, of at least one of the following modes 
associated with the multimedia content: an audio portion that includes video; and a 
video portion that includes audio; and 

5 annotating observations of a selected mode. 
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